layout: true
In this course, we will provide an introduction to the basic concepts and functionalities of R and go through a prototypical data analysis workflow: import, wrangling, exploration, (basic) analysis, and reporting.
By the end of this course you shouldโฆ
R and RStudioRRR MarkdownNote: This is not a statistics workshop. Our focus will be on learning how to use R.
The only way to write good code is to write tons of shitty code first. Feeling shame about bad code stops you from getting to good code
— Hadley Wickham (@hadleywickham) April 17, 2015
.large[ - Working versions of R (<= version 4.0.0) and RStudio on your computer
Prior experience with quantitative data analysis, basic statistics, and regression
Experience with using other statistical packages (e.g., SPSS or Stata) is helpful ]
.small[ - Senior researcher in the team Data Augmentation at the GESIS department Survey Data Curation and (co-)leader of the team Research Data & Methods at the Center for Advanced Internet Studies (CAIS)
Main areas:
Ph.D.ย in Psychology, University of Cologne
Previously worked in several research projects investigating the use and effects of digital media (Cologne, Hohenheim, Mรผnster, Tรผbingen)
Other research interests
johannes.breuer@gesis.org | [@MattEagle09](https://twitter.com/MattEagle09) | personal website ]
.pull-left[ ]
.pull-right[ - Postdoctoral researcher in the team Data Augmentation at the GESIS department Survey Data Curation - Ph.D.ย in social sciences, University of Cologne]
.small[ stefan.juenger@gesis.org | [@StefanJuenger](https://twitter.com/StefanJuenger) | https://stefanjuenger.github.io]
Rneys.small[ Johannes - was socialized with SPSS - was annoyed with AMOS when learning structural equation modeling (around 2011) - decided to learn to use the lavaan package for R instead of MPlus to avoid being dependent on yet another proprietary software package - attended an introductory Data analysis with R course at GESIS in 2012 - only used R for SEM for some time, while still doing everything else (esp.ย data wrangling) with SPSS - finally made the full transition to R when joining GESIS in 2017
Stefan - learned statistical โprogrammingโ when SPSS was still the major player in town - got hooked by R somewhere around 2008 or 2009 because of the plots - wrote horrible code and estimated multilevel models that took forever to be estimated - switched to R for geospatial data in 2015, wrote his first (bad) R package for geo-stuff - tried Python, uses Python occasionally, but is forever in love with R โค๏ธ ]
Whatโs your name?
Where do you work/study? What are you working on/studying?
What is your experience with R or other programming languages?
What statistical software package(s) do you typically use?
What do you want to use R for?
Please try to keep it short (3 to 4 sentences or ~30 secs).
The workshop consists of a combination of short lectures and hands-on exercises
For the time after the workshop sessions each day, we have also prepared some optional (and hopefully fun) โextracurricular activitiesโ
Slides and other materials are available at
.center[https://github.com/jobreu/r-intro-gesis-2021]
If possible, we invite you to turn on your camera
If you have an immediate question during the lecture parts, please send it via text chat
If you have a question that is not urgent and might be interesting for everybody, you can also use audio (& video) to ask it at the end of a lecture part or during the exercises
We will try to provide (one-on-one) โtech supportโ during the exercises
We would also kindly ask you to mute your microphones when you are not asking (or answering) a question
| Day | Time | Topic |
|---|---|---|
| Monday | 10:30 - 11:30 | Getting Started with R and RStudio |
| Monday | 11:30 - 11:45 | Break |
| Monday | 11:45 - 12:45 | Getting Started with R and RStudio |
| Monday | 12:45 - 13:45 | Lunch Break |
| Monday | 13:45 - 15:00 | Data Import & Export |
| Monday | 15:00 - 15:15 | Break |
| Monday | 15:15 - 16:30 | Data Import & Export |
| Day | Time | Topic |
|---|---|---|
| Tuesday | 10:00 - 11:15 | Data Wrangling - Basics |
| Tuesday | 11:15 - 11:30 | Break |
| Tuesday | 11:30 - 12:45 | Data Wrangling - Basics |
| Tuesday | 12:45 - 13:45 | Lunch Break |
| Tuesday | 13:45 - 15:00 | Data Wrangling - Advanced |
| Tuesday | 15:00 - 15:15 | Break |
| Tuesday | 15:15 - 16:30 | Data Wrangling - Advanced |
| Wednesday | 10:00 - 11:15 | Exploratory Data Analysis |
| Wednesday | 11:15 - 11:30 | Break |
| Wednesday | 11:30 - 12:45 | Exploratory Data Analysis |
| Wednesday | 12:45 - 13:45 | Lunch Break |
| Wednesday | 13:45 - 15:00 | Data Visualization - Part 1 |
| Wednesday | 15:00 - 15:15 | Break |
| Wednesday | 15:15 - 16:30 | Data Visualization - Part 1 |
| Day | Time | Topic |
|---|---|---|
| Thursday | 10:00 - 11:15 | Confirmatory Data Analysis |
| Thursday | 11:15 - 11:30 | Break |
| Thursday | 11:30 - 12:45 | Confirmatory Data Analysis |
| Thursday | 12:45 - 13:45 | Lunch Break |
| Thursday | 13:45 - 15:00 | Data Visualization - Part 2 |
| Thursday | 15:00 - 15:15 | Break |
| Thursday | 15:15 - 16:30 | Data Visualization - Part 2 |
| Friday | 10:00 - 11:15 | Reporting with R Markdown |
| Friday | 11:15 - 11:30 | Break |
| Friday | 11:30 - 12:45 | Reporting with R Markdown |
| Friday | 12:45 - 13:45 | Lunch Break |
| Friday | 13:45 - 15:00 | Advanced Use of R, Outlook, Q&A |
| Friday | 15:00 - 15:15 | Break |
| Friday | 15:15 - 16:30 | Advanced Use of R, Outlook, Q&A |
R?R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS (
RProject website).
R is free and open-source software (FOSS) and also a programming language. More specifically, it is a free, non-commercial implementation of the [S programming language](https://en.wikipedia.org/wiki/S_(programming_language) (developed by Bell Laboratories).
RR was created by Ross Ihaka and Robert Gentleman at the Department of Statistics at the University of Auckland (NZ) in 1993
The R Core Group that has been responsible for the development of R since then and CRAN were founded in 1997
version 1.0.0 of R was released in 2000
RStudio was initially released in 2011
today (August 2, 2021) we are at version 4.1.0 (version 4.1.1 is scheduled for Aug 10, 2021)
If you want to know a bit more about the origins and history of R as well as the philosophy behind it, the book R Programming for Data Science by Roger D. Peng provides a good summary. Alternatively, you can also watch this YouTube video in which David Smith talks about Twenty Years of R.
RiginsA CD-ROM of the original version of R, signed by the R Core Team #fangirling #gbifbid pic.twitter.com/SnNiZpHXqi
— Dr. Hannah Owens (@HannahOish) September 2, 2018
R?โ
R? Thereโs R keeps on expandingโ
โ
โ
R user groups around the globe
โ
โ
RYou can use R toโฆ
โ
R memesโ
โ
โ
โ
RSome of the things you can do and create with R includeโฆ
RYou can download R via the R Project website. The exact installation process depends on your operating system (OS). The R Cookbook provides a detailed explanation of the installation process for Windows, macOS, and Linux/Unix.
If you want or need to update your version of R, you can do this the same way as for the first-time installation. If you use Windows, you can also use the installr package to update R (we will talk about packages in a bit).
RR comes with a basic GUI (on Windows you can access it by opening the Rgui.exe file). However, it is quite limited in terms of its functionalities.
RUsing an IDE provides several advantages, such as:
syntax highlighting
auto-completion
better overview of files, libraries, created objects/output
RStudio is the most widely used IDE for R.1 In addition to the general advantages of an IDE, it has some specific ones:
easy integration with version control via Git (for a good tutorial on this, see Happy Git and GitHub for the useR)
interfaces to Python via the reticulate package and SQL, e.g., via the dbplyr package
possibility to install and use addins that extend the functionalities of the RStudio GUI (for an overview of RStudio addins, you can check out this curated list by Dean Attali)
new versions also include (live) spellchecking features and a visual editor for R Markdown
.footnote[ [1] There are, of course, other IDEs that can be used with/for R. Another popular option is Visual Studio from Microsoft (for which an R extension is available).]
You can download the installer for your OS from the RStudio website. The R Cookbook also provides some more details on how to install and start RStudio.
When you open RStudio for the first time it should look like this (only in white instead of black and maybe not with R startup messages in German):
R console in RStudioThe console is the interactive input-output window of RStudio. You can enter commands here and press Enter to execute them. Typically, the output the the commands you enter into the console will also be displayed here.
If you see the > in the console, it means that it is ready to receive commands.
If you see a + at the beginning of the console input line, this means that the command is incomplete. A common reason for this is a missing ) or ". If you see the + at the beginning of the console input line, you can either complete the command (and then run it by pressing Enter/Return) or abort entering the command by pressing Esc.
Once you have executed at least one command in the console you can cycle through previous ones using โ and โ on your keyboard.
R as a calculatorThe simplest thing you can do with the R console is to use it as a calculator.
3+2
## [1] 5
2^3
## [1] 8
1/3
## [1] 0.3333333
Note: In the console, you wonโt see the ## in the output. The [1] before the result indicates that this is the first output value of the command (more complex commands can have more than one output value).
R as a calculator100^3
## [1] 1000000
1/2500
## [1] 0.0004
For printing very small and very large numbers, R uses scientific notation. If you want to avoid this, you can use the command options(scipen=999). NB: This setting will only be active for the current session.
options(scipen=999)
100^3
## [1] 1000000
1/2500
## [1] 0.0004
RR is an object-oriented programming language. The simplest example of assignment in R is the assignment of a single value to an object. This value can, e.g., be an single number or a character string.
x <- 10
y <- "This is a character string"
x
## [1] 10
y
## [1] "This is a character string"
R objects in RStudioOnce one or more objects have been assigned values they also appear in the Environment tab in RStudio.
R workspaceThe Environment tab in RStudio shows the content of your current working environment (also called workspace) which includes any used-defined objects. The contents of the current environment are stored in the working memory (RAM) of your computer until you exit R (or RStudio).
Note: The fact that R objects are stored in your computerโs RAM can become problematic if you work with โbig dataโ. However, there are solutions for working with larger-than-RAM data in R (such as disk.frame).
Rs memory useIn the newest versions of RStudio, the Environment tab includes a small icon that displays the systemโs overall memory use (displayed as a pie/donut chart) and the amount of RAM used by R (the number next to that).
Rs memory useFrom the dropdown menu next to that icon, you can also select Memory Usage Report to get more detailed information about current working memory (RAM) use.
R: Functions and packages ๐ดIf you want to do anything in R, you need to use functions, and functions are provided through packages. We will go through the basics of functions and packages in R in the following.
Put simply, a function takes an input, does something with it, and produces some sort of output. Functions typically have arguments. In the simplest case, a function only requires an input (a value or object) as a single argument (some functions even require no argument).
sqrt(9)
## [1] 3
x <- 9
sqrt(x)
## [1] 3
The output of a function can, of course, also be assigned to an object.
x <- sqrt(9)
x
## [1] 3
Note: Technically, functions are also objects in R.
Most functions in R have more than one argument.
y <- "This is a character string"
# in the character string named y: replace i with X
gsub(pattern = "i", replacement = "X", y)
## [1] "ThXs Xs a character strXng"
If you want to know how to use a function, you can consult its help file. You can do that via the ? command:
?gsub # ?function_name
In RStudio, this will open a file in the Help tab.
Functions can have required and optional arguments. Required arguments need to be specified for a function to run, whereas optional arguments have defaults and, hence, do not have to be provided in a function call. You can easily identify required and optional arguments in the Usage section of the help file for a function: If the argument is in the format argument = value it is optional. If only the argument name is provided function(argument_1), this means that this argument is required.
Function arguments can be provided in the specified order or by referencing them by name (in which case the order can change). For example, the following two versions of the gsub function call are both valid.
y <- "This is a character string"
gsub("i", "X", y)
## [1] "ThXs Xs a character strXng"
gsub(y, replacement = "X", pattern = "i")
## [1] "ThXs Xs a character strXng"
Typing the argument names is more work but it increases the comprehensibility of your code for human readers.
If you want to understand the โinner workingsโ of a function (or maybe use code from existing functions for writing your own functions), you can also print the function body by just running the function name without the parentheses behind it.
gsub
## function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
## fixed = FALSE, useBytes = FALSE)
## {
## if (is.factor(x) && length(levels(x)) < length(x)) {
## gsub(pattern, replacement, levels(x), ignore.case, perl,
## fixed, useBytes)[x]
## }
## else {
## if (!is.character(x))
## x <- as.character(x)
## .Internal(gsub(as.character(pattern), as.character(replacement),
## x, ignore.case, perl, fixed, useBytes))
## }
## }
## <bytecode: 0x0000024b609c13c8>
## <environment: namespace:base>
class: center, middle
R packagesThe key elements of the R universe are its packages. They essentially are collections of functions (and sometimes also datasets) and provide some form of documentation for those.
The basic R system as well as a huge number of additional packages that extend its functionalities are available via The Comprehensive R Archive Network (CRAN).
CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R (CRAN website).
base RWhen we talk about base R we typically refer to the set of packages that come with a new installation of R via CRAN.
There also is a package called base included with this but the base R system includes a number of other packages as well: utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.
In addition, a new installation also includes the following โrecommendedโ packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.
CRAN provides an alphabetically sorted list with all available packages. You can search for your keywords of interest in that list, but that is not the most convenient option.
Two helpful resources for finding R packages:
CRAN Task Views provide curated lists of recommended packages for specific tasks/areas/topics
METACRAN allows you to search and browse all packages on CRAN
Of course, you can also use your search engine of choice and search for what you want to do plus โR packageโ (example: โANOVA R packageโ), and we will introduce you to many useful packages for various purposes throughout this course.
RInstalling packages from CRAN in R is very straightforward.
# Install a package
install.packages("correlation") # single or double quotation marks
# Install multiple packages at once
install.packages(c("correlation", "effectsize"))
R packages are installed in specific directories on your computer. NB: If you have multiple versions of R installed, there are directories for each version (with the exception of minor updates: e.g., 4.0.1 and 4.0.2 share the same folder for installed packages, whereas 3.6.0 and 3.7.0 do not). To find where packages are installed on your machine you can use the following command:
.libPaths()
Once you have installed a package, you need to load it to be able to use the functions (and/or datasets) it contains in your R session.
library(correlation) # no quotation marks needed
R packagesWhile it is the main source, not all packages for R are available via CRAN. Another important source of R packages, especially those that are still in early development, is GitHub. To be able to install packages hosted on GitHub you need to use functions from the devtools or the remotes package (which you need to install first as they do not come with base R). For example, if you want to install the RPG dice roll package that I mentioned before:
.small[
# Option 1
library(devtools)
install_github("Felixmil/rollR") # last part of the GitHub URL (user name + repository name)
# Option 2
library(remotes)
install_github("Felixmil/rollR") # last part of the GitHub URL (user name + repository name)
]
Note: To be able to install packages from GitHub on Windows machines, you will need to install Rtools first.
There are a few packages that facilitate the installation and loading of R packages (from various sources). Two popular ones are:
You can get information about the packages you have installed on your system with the following function:
installed.packages()
You can also use the Packages tab in the RStudio GUI to install, load, update, and uninstall packages. You can load a package by clicking the checkbox on the left side of its name. However, to make sure that you (and others) can reproduce what you have done, you should ideally include the installation and loading of packages as part of your R scripts.
R scriptsWhile the console is useful for trying things out, you should not use it for your actual data analysis. For this you should use R scripts that allow you to store and document your code. R scripts are similar to syntax files for SPSS or do-files for Stata. R scripts have the file extension .R.
In RStudio, you can create a new script via the menu (File -> New File -> R Script), by clicking the small white sheet icon with the green + symbol and choosing R Script, or through the keyboard shortcut Ctrl + Shift + N (Windows & Linux)/Cmd + Shift + N (Mac). You can open an existing script by clicking on it in the files tab, by clicking the open folder icon, via File -> Open File, or using the keyboard shortcut Ctrl + O (Windows & Linux)/Cmd + O (Mac).
When you open or create a script in RStudio this will be displayed in a fourth pane (which will have multiple tabs if you open/create more than one R script or other types of source files).
R scriptsYou can write your code in an R script just like you do in the console.
If you want to execute a single command from your script in RStudio, you can do so by placing your cursor somewhere in command (or directly after it) and clicking the Run button in the menu or by using the keyboard shortcut Ctrl + Return (Windows & Linux)/Cmd + Enter (Mac). This also works if you select multiple lines of code/commands.
You can also run all commands in your script by selecting Run all from the dropdown menu next to the Run button or via the keyboard shortcut Ctrl + Alt + R (Windows & Linux)/Cmd + Option + R (Mac).
You can save your script in RStudio via File -> Save or Save As..., by clicking the small blue floppy disk icon, or through the keyboard shortcut Ctrl + S (Windows & Linux)/Cmd + S (Mac).
R scriptsTo properly document your code (for your future self as well as other people who may use your code) it is good practice to use comments. In R scripts, you can create a comment by starting a line with a #.
In RStudio, to comment or uncomment one or more lines in a script you can also select them and use the keyboard shortcut Ctrl + Shift + C (Windows & Linux)/Cmd + Shift + C (Mac).
# this is a comment
library(tidyverse)
R and RStudioIn the following slides, we will present some suggestions for adopting a couple of settings and practices that help you develop and implement workflows for R and RStudio that minimize mess and increase reproducibility.
In this session, we will only cover the basics that are necessary for establishing such workflows. If you are interested in some further information on setting up and maintaining your installation of R and RStudio as well as the optimization of workflows, and troubleshooting, you can check out the appendix slides with additional materials that we have created on these subjects.
Note: Most of the recommendations in the following (as well as in the additional materials) are largely based on the freely available online book What They Forgot to Teach You About R.
The working directory is where R will look for and save files by default.
You can check your current working directory with the following command:
getwd()
In RStudio, the current working directory is also displayed at the top of the Console tab.
There are two ways in which you can set/change your working directory:
The RStudio menu Session -> Set Working Directory which provides different options:
โTo Project Directoryโ: can be used if you have an .Rproj file (more on that later)
โTo Source File Locationโ: sets the working directory to the location where the currently active source file - typically an R script - is stored
โTo FilesPane Locationโ: sets the working directory to the directory that is currently visible in the Files tab
โChoose Directoryโ: opens a file browser window that lets you choose a directory
To increase the reproducibility of your work, however, using functions in scripts is generally the better approach.
You can set a working directory with the following command (of course, you need to replace the file path with the correct one for your system):
setwd("C:/Users/user/Documents/analysis")
There are absolute (example: โC:/Users/user/Documents/example.Rโ) and relative file paths (example: โ./r-scripts/example.Rโ). Relative file paths are relative to the current working directory. Common shorthand options for relative file paths are . for the current (working) directory, .. for one folder level up (parent folder), and ~ for the home directory (which is the default working directory in R).
To facilitate the reuse of your code on other systems (by you or others), it is generally preferable to use relative file paths.
Note: R uses Unix-style file paths with /, while Windows uses \ in file paths. However \\ also works in R. There is a Stackoverflow post discussing several ways of dealing with that. A helpful tool in this context is Path Copy Copy which is an add-on for the Windows file explorer that lets you copy file paths in different formats.
There are quite a few features of RStudio that can make your life as an R user much easier. We will briefly discuss four of them in the following:1
RStudio projects
Keyboard shortcuts
Autocomplete for code
Customization options
.footnote[ [1] If you want to discover some more of the benefits of using RStudio, you can check out the appendix slides.]
RStudio projects are helpful tool for developing a project-oriented workflow that can enhance reproducibility.
You can create a project via the RStudio menu: File -> New Project. RStudio projects are associated with .Rproj files that contain some specific settings for the project. If you double-click on a .Rproj file, this opens a new instance of RStudio with the working directory and file browser set to the location of that file (the repository/folder for this workshop contains an .Rproj file, if you want to try this out).
Explaining RStudio projects in detail is beyond the scope of this course, but there are good tutorials available, e.g., on the RStudio support site or in the respective chapter in What They Forgot to Teach You About R.
RStudio offers a wide range of useful keyboard shortcuts. You can access a Keyboard Shortcut Quick Reference in RStudio via Help -> Keyboard Shortcuts Help. There even is a keyboard shortcut for accessing the keyboard shortcuts help (very meta): Alt + Shift + K (Windows & Linux)/Option + Shift + K (Mac).
One RStudio keyboard shortcut that is particularly helpful for writing R code is the one for the assignment operator: Alt + - (Windows & Linux)/Option + - (Mac).
Once you start typing a command in RStudio (in the console or a script), RStudio will make autocomplete suggestions (for functions but also other objects). You can cycle through these suggestions using โ and โ on your keyboard. If you move your mouse cursor to one of the suggestions, RStudio displays an excerpt from the help file of that function. You can accept a suggestion by selecting it and pressing Tab.
By default, R stores your workspace and command history when closing a session (and also restores the former upon startup). While this can be helpful, this creates files that you probably will not use, and can also be a barrier for adopting reproducible workflows.1
To avoid that, there are some general settings in RStudio that you might want to change via Tools -> Global Options -> General.
.footnote[ [1] Again, if you want to know more, have a look at the appendix slides.]
use R scripts to store your code
save/export important output in appropriate file formats (more on that in the following session on Data Import & Export)
(try to) use relative file paths in your scripts
eventually consider adopting a project-based workflow (using .Rproj files)
In case you get an error message or if your R session crashes, there are a couple of things you can do/try out:
copy the error message into your preferred search engine
abort R process: Session -> Terminate R in the RStudio menu or by clicking the stop shield icon in the upper right corner of the console
Restart R (RStudio menu: Session -> Restart R) or RStudio
re-install packages
.center[ Source: https://s.unhb.de/DqKxb]
R codetypos (e.g., capitalization in package names)
missing or unmatched (, ', or " (often at the end of a command)
\ instead of / in file paths (e.g., when copied from the Windows explorer)
packages not installed or loaded
code (chunks) executed in the wrong order
.center[ GIF by Allison Horst]
class: center, middle
Check out appendix slides with additional materials for this session
Watch the talktalk by David Smith on the history of R on YouTube
Explore the #rstats hashtag on Twitter